N90-16443 

HOW TO CLUSTER IN PARALLEL WITH NEURAL 

NETWORKS 


Behzad Kamgar-Parsi J- A. Gualtieri 

Center for Automation Research Code 635 

University of Maryland NASA GSFC 

College Park, MD 20742 Greenbelt, MD 20771 


Behrooz Kamgar-Parsi 
Dept, of Computer Science 
George Mason University 
Fairfax, VA 22030 


Judy E. Devaney 
Science Applications Research 
4400 Forbes Blvd. 
Lanham, MD 20706 


abstract 

Partitioning a set of N patterns in a d-dimensional met- 
ric space into K clusters - in a way that those in a given 
cluster are more similar to each other than the rest - 
is a problem of interest in astrophysics, image analysis 
and other fields. As there are approximately possible 
ways of partitioning the patterns among K clusters, find- 
ing the best solution is beyond exhaustive search when N 
is large. We show that this problem in spite of its expo- 
nential complexity can be formulated as an optimization 
problem for which very good, but not necessarily opti- 
mal, solutions can be found by using a neural network. 
To do this the network must start from many randomly 
selected initial states. The network is simulated on the 
MPP (a 128x128 SIMD array machine), where we use 
the massive parallelism not only in solving the differen- 
tial equations that govern the evolution of the network, 
but also by starting the network from many initial states 
at once thus obtaining many solutions in one run. We 
obtain speedups of two to three orders of magnitude over 
serial implementations and the promise through Analog 
VLSI implementations of speedups comensurate with hu- 
man perceptual abilities. 

Keywords: Combinatorial Optimization, Synchronous 

Analog Network, Parallel Simulation, SIMD. 

INTRODUCTION 

Problems that involve data analysis are becoming in- 
creasingly severe in that data sets are becoming very large 
and their rate of acquisition is growing rapidly. It is clear 
that humans possess immense computational power for 
solving certain problems through visualization and that 
what is needed is the development of algorithms that have 
some of these capabilities. 


The value of neural networks - whose development has 
been motivated by human beings’ computational capabil- 
ities - as a computational device is yet to be explored. In 
fact, little is known about the reliability and complexity 
of these algorithms, and how they scale with the size of 
the problem. The work we present in this paper is an 
attempt to answer some of these questions. For this, we 
will concentrate on the problem of data clustering - a 
problem of interest in astrophysics, image analysis and 
other fields. The conjecture is that because of the many 
connections among neurons, neural networks should be 
particularly useful for the class of problems that involve 
collective decision making, of which one example is un- 
supervised clustering. Here the patterns must decide to- 
gether how to partition themselves into subsets according 
to a given criterion. The problem considered here, as in 
all partitioning problems, is a discrete optimization with 
a goodness-of-fit criterion. By embedding this discrete 
problem in the continuous space of an analog network 
one can perform a downhill search on the energy surface 
which is more purposeful and effective than the search 
in the discrete space. Until hardware implementation of 
analog neural networks in VLSI become available - which 
is expected in the next few years [l] - simulation is going 
to be an indispensible tool in the study and design of these 
systems. Analog networks are intrinsically synchronous 
and hence well suited for simulation on massively parallel 
SIMD machines. 

In this paper, we simulate the neural net we propose for 
solving the clustering problem on the MPP [a 128x128 
SIMD array machine with 1024 bits of local memory per 
processor]. The issue of performance of neural net algo- 
rithms on parallel machines is also addressed. Before we 
proceed, however, we will discuss the clustering problem 
in some detail. 
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THE CLUSTERING PROBLEM 

By clustering we mean partitioning a set of N patterns 
(the patterns are represented as points in a d-dimensional 
metric space) into K clusters in a way that those in a 
given cluster are more similar to each other than the 
rest. As there are approximately possible ways of 
partitioning the patterns among K clusters [2], the prob- 
lem has exponential complexity and finding the best so- 
lution is beyond exhaustive search. As is often employed, 
we let our criterion for best solution be the minimum 
square-error. That is, representing the patterns by d- 
dimensional points {#5|* = 1, the best solution 

is the one minimising X 2 = ££1 (r. (p) - ^,) 2 with re- 
spect to {^>|p = 1 K}. Here cluster p contains the 

subset of the points, {f^}, and its centroid is given by 

Bp = where N p is the number of points in 

the cluster. A partitioning based on such a criterion is 
also known as minimum variance partioning. Because of 
the complexity of the problem, finding the best solution 
may not be possible. This, however, is not a major con- 
cern, because in practice usually only a good solution is 
sufficient. 

Due to the importance of this problem many meth- 
ods have been proposed by various researchers. (See Jain 
and Dubes [3] for a survey of the literature.) Many of 
these approaches are based on iterative schemes and of- 
ten the differences between the suggested algorithms are 
quite subtle. The number of clusters K may or may not 
be fixed. For a given value of the essence of iterative 
algorithms is as follows. 

After the initial partioning of the patterns into K clus- 
ters, their centroids, i.e. seed points in the d-dimensional 
metric space of the patterns, are computed. Each pattern 
is then assigned to the cluster with the nearest seed point 
and new centroids are computed. The process is repeated 
until the partitioning ceases to change. However, the pro- 
cess of the computation of new centroids can be carried 
out in two ways: (i) Keep the centroids fixed until the 
distances of all patterns to the K centroids are computed 

(**) Update centroids as frequently as one pattern is 
found to be closer to the centroid of a cluster other than 
the one it is assigned to. In this case, the pattern is imme- 
diately reassigned and the centroids of the winning and 
the losing clusters are updated [5]. This method is some- 
times referred to as Jf-means. Note that for a parallel 
machine, where the distances of the patterns from clus- 
ter centroids can be computed simultaneously, the first 
approach appears to be more efficient. 

The neural net approach that we propose has many 
similarities with the iterative scheme described above. As 
will be explained later in more details, the major differ- 
ence, however, is that the neural net allows a given pat- 


tern to belong to several clusters until the final iteration. 
That is, at least during the execution of the algorithm, 
a given pattern belongs to all clusters, though with dif- 
ferent weights. The closest conventional method to this 
is the one proposed by Gordon and Henderson [6]. In 
their method, however, the sum of the weights for every 
pattern is restricted to one at any given iteration; thus, 
it dose not possess the full flexibility of neural networks. 

As for the initial cluster centroids, one may take the 
first K points of the input data, which is very simple and 
inexpensive; or if one suspects the input points are pre- 
arranged in some special way, one may choose at random 
any K points of the input data [7]. More elaborate and 
expensive methods for choosing more promising initial 
centroids have been proposed in the literature (see Ref. 
[8] and [3]). Such methods, however, are not of interest 
to us. 

OPTIMIZATION WITH NEURAL NETS 

It has been recognised in recent years that artificial 
neural networks have computational properties [9,10]. 
The Hopfield model of neural network, which we use in 
this work, is particularly suitable for solving certain op- 
timization problems. A neuron is a simple nonlinear pro- 
cessor that is connected to many (possibly all) other neu- 
rons in the network; it adds up the signals it receives 
from other neurons and fires a signal accordingly. The 
state of the network, that is the firing rates or activi- 
ties of the neurons, through interactions with each other, 
change with time but eventually the network settles into 
a steady state where the neuronal activities remain con- 
stant. The energy of the Hopfield network is Lyapunov 
(i.e. it does not increase with time) and its minima are 
the steady states of the network. It is this property of 
neural networks that is used in optimization. The ap- 
proach is to cast the problem in terms of an energy func- 
tion that is then minimized by the corresponding network 
as it evolves spontaneously from some randomly selected 
initial state to states of lower energy. The energy function 
has typically many minima that represent valid solutions 
to the problem; deeper minima correspond to good solu- 
tions and the deepest minimum to the best solution. 

In this paper we use analog neural nets, because they 
outperform digital nets in solving optimization problems 
[9,11]. Many problems of interest, including the problem 
we address in this paper, can be cast in terms of an energy 
function, E } that is quadratic in the neuronal activities 
and has the form [9], 


E = -- 


2 X. TjjVjVj + f dxg ^x). 

<=iy=i i=i i=i J 

(1) 

Here n is the number of neurons in the network, and 
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Vi (0 < Vi < 1) is the activity or firing rate of neuron 
i. The first term in (1) is the interaction energy among 
neuronSi and the elements of the connection matrix, Tij 

= _ are completely determined from E. In the 

second term J< is the bias or activity threshold of neuron 
i. The third term encourages the network to operate in 
the interior of the n dimensional unit cube {0 < V< < 1} 
that forms the state space of the system. In this term r 
is the self-decay time of the neurons, and g(u), a sigmoid 
function, is the gain or transfer function of the neurons 
that relates the input 14 to the output V*. A standard 
form for g, which we will also use, is 

Vi = g(«0 = ^(1 + taph ^) = 1 + e -2m/ ' u T ’ (2) 

where u 0 determines the steepness of gain. The neuronal 
activities, Vi, as well as the input signals, u*, depend on 
time t. The evolution of the network is determined by 
the n coupled ordinary differential equations, du{/dt = 
— dE/dVi , which are 

£t--SU£iVV’, + /,. W 

dt T 3 = 1 

We will set r = 1, so that time is measured in units of 
r. Note that the fctaa-term can be eliminated from the 
energy and instead incorporated into the gain function if 
we define V, = 3(14 - r/j). 

To find a solution (i.e. a minimum), we start the net- 
work from a randomly selected state and let it evolve 
freely until it reaches a minimum of the function E and 
stops. As is usual in dealing with computationally in- 
tractable problems, we find not just one but several solu- 
tions by starting the network from different initial states, 
and then take the best one as the solution which may 
or may not be the optimum. Since a neural network con- 
verges rapidly to a minimum we can afford to run it many 
times thus ensuring that we find at least a very good solu- 
tion. Below, we discuss how to construct an appropriate 
network for solving this problem. 

CONSTRUCTION OF THE ENERGY FUNC- 
TION 


the search, belongs to one and only one cluster; (ii) the 
cost term which is the sum of the residuals and is the 
function we actually wish to minimise. The formulation 
can best be illustrated through an example. Let us con- 
sider the case where we wish to partition N = 10 points 
into K = 3 clusters. A possible solution (not necessarily 
the best one) would be that, say, points 1, 2, 6 and 9 be- 
long to cluster A, points 4 and 5 belong to cluster B, and 
points 3, 7, 8 and 10 belong to cluster C. This particu- 
lar solution can be represented by the 3x10 rectangular 
array given in Table 1, where the rows are labeled by the 
clusters and the columns are labeled by the points. The 
elements of this matrix are 0 or 1 with the interpretation 
that “element A 1=1” indicates that point 1 belongs to 
cluster A, “element B1=0” indicates that point 1 does 
not belong to cluster B, and so on. 

Table 1: A possible solution for partitioning 10 points 
into 3 clusters. 


Cloister 

Points 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

A 

1 

1 

0 

0 

0 

1 

0 

0 

1 

0 

B 

0 

0 

0 

1 

1 

0 

0 

0 

0 

0 

C 

0 

0 

1 

0 

0 

0 

1 

1 

0 

1 


If we think of the elements of this matrix as the activities 
of neurons (n = K X N neurons altogether), and denote 
them by V p it where p and i refer to the cluster and the 
point, respectively, then the constraint part of the energy 
function, E, can be expressed as 


M-$±±±v*v* + M 


I=lp=l q^p 


t=l p - 1 


where the coefficients A and B are positive constants. 
The A-tcrm has its minimum value (i.e. zero) if in each 
column (representing a point) at most one neuron is active 
and the rest are off. The B-tertn has its minimum value 
(also sero) if the sum of activities in each column equals 1. 
The two terms together enforce the syntax of the solution 


We want to partition a set of N points in a 2-D plane 
into the best K clusters (generalisation to arbitrary di- 
mensions is trivial) - best in the sense that sum of the 
squares of the distances of the points from their respective 
cluster centroids (Le. sum of “within cluster variances*) 
is minimised. We formulate the problem in a manner 
that can be solved by a neural network; that is we cast 
the problem in terms of an energy function that can be 
minimized by the network. 

The energy function will consist of two parts: (i) con- 
straint terms which make certain a point, at the end of 


given in Table 1. 

There is an additional constraint that we should, in 
principle, include in the energy function: that each cluster 
should contain at least one point. In terms of the solution 
matrix of Table 1 it means that in each row there should 
be at least one fully active neuron. Such a constraint can 

be imposed by 53 p _i ^p*) > w h ere ®( x ) — ® 

for x < 0 and ©(x) = 1 for x > 0 is the step function. 
However, since this term is nonanalytic its inclusion in 
the energy function creates problems and a better strat- 
egy appears to be to leave out this term and rather reject 
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those solutions that violate this constraint. In our sim- 
ulations of neural networks (several thousand trials) the 
solutions never violated this constraint. Therefore, it ap- 
pears that the absence of this constraint from the energy 
function is of little consequence. 

To complete the energy function we must also formulate 
the cost term. We denote the square of the distance of 
point i from the centroid of cluster p (i.e. the residual) 
with Rpi which is given by 

J^ = (*i-jr p ) a + ( w -y F )» ( 5 ) 

where (x*, y<) are the coordinates of point i, and (X pi Y p ) 
are the coordinates of the centroid of cluster p. Here we 
have chosen the Euclidean distance as the metric; but one 
can define any metric one wants. Let us consider again 
the solution represented by Table 1. The sum of residuals 
or the cost for this solution is 

(i^Al + Ra2 “b ^46 ^vid) + {Rb4 4" Rb&) 

+ + Rc7 + Res + Rcio)) (6) 

which can be written as 

EEW. ( 7 ) 

P=1 t= 1 

Hence the energy function E, including cost and con- 
straint, for this problem can be expressed in the final form 

. N K k N K 

s= yEEEw+ f DE ^ - 1) 2 

»=1 P=1 q^p 1=1 p=l 

r K N 

+jEE^. ( 8 ) 

p=i t-i 

where C is also a positive constant. When the constraints 
(or the syntax) are satisfied the A-term and the R-term 
vanish and the energy function, E t reduces to just the 
cost term, therefore deep minima of E correspond to good 
solutions, and the deepest minimum to the best solution. 

The network dynamics, obtained from - dE/dV p*, are 

d u K K 

-fi L = -"pi-A'52v qi -BQT V qi - 1) - CRpiVpi + /pi. 

q^p q=l 

( 9 ) 

Note that (8) is only the quadratic part of the energy 
function corresponding to the first term in (1), and that 
the two terms I pi and — Up,- in (9) come from the second 
and third terms in (1), respectively. 

To find a solution we assign random values between 0 
and 1 to all the n = K x N neuronal activities, V^*. Thus 
the N points are partitioned into K clusters. Note that 
the partitioning is not done in the proper sense that a 
point belongs to a particular cluster and to no others; 


rather, point i is partitioned among all the K clusters 
with varying strengths that are the magnitudes of V#, 
that is, we interpret V# as the strength of hypothesis 
that point i belongs to cluster p. Hence the centroid of 
cluster p is obtained from the weighted average 

x p = Y1 x * v p*/ = E y* v pi/ ^2 v p* • 

*=i »=i »=i t-i 

( 10 ) 

As the state of the network changes with time the cen- 
troids, as well as the residuals also change. Start- 
ing from this randomly selected initial state the network 
evolves toward states of lower energy according to the 
equations of motion (9), until it reaches a minimum en- 
ergy state and stops. The downhill motion of the network 
on the energy surface is guided toward a proper solution 
(one that satisfies the constraints) by the A- and R-terms 
and toward solutions of good quality by the C-term. As 
the network is searching for a solution the constraints are 
most surely violated since most neurons are partially ac- 
tive. Only at the end of the search when a solution is 
found the clustering becomes unambiguous. Note that 
the energy E also contains other minima that do not cor- 
respond to solutions (Le. violate the syntax); such min- 
ima when found by the network are of course rejected as 
meaningless. 

We remark that the cost term (7) can be written as a 
linear function of activities such as RpiVpi which is 6ias- 
like rather than interaction-like. However, bias-like terms 
are not as effective in breaking the symmetry among the 
states that satisfy the syntax, and leave the energy land- 
scape more flat. Hence it will not be as easy for the 
network to find valid solutions as it frequently becomes 
stuck in the middle of the n-dimensional unit cube. This 
is confirmed in our simulations, where the rate of success 
for finding valid solutions drops significantly when we use 
the linear form for the cost. 

For simulations we have chosen the following values for 
the parameters of the energy function: A — B — 1, 
C — 0.9 / Ravgt all Ipi = 1, and the gain function pa- 
rameter Uq = 0.1. Scaling parameter C with the average 
residual R a vg is necessary to ensure good solutions, be- 
cause as the network evolves, the residuals become gen- 
erally smaller and the cost term becomes less effective in 
driving the network toward good solutions; this rescaling 
of parameter C keeps the cost term of the same order of 
magnitude as the syntax terms. 

PARALLEL IMPLEMENTATION 

We have simulated the behavior of the neural net on 
the MPP. To do this we first generate a random initial 
state {Vpi(t =0)} and then solve the equations of motion 
(9) to find which of the minima (or solutions) it converges 
to. Solutions of ordinary differential equations, such as 
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the equations of motion, lend themselves very nicely to 
a massively parallel computational approach. In addi- 
tion, since we want to find several solutions starting from 
different initial states - as is usual in computationally in- 
tractable problems - we run several trials at once on the 
MPP. Thus the speedup comes from parallel solution of 
the differential equations as well as running several trials 
at the same time. 

We use the Euler method [12] with a fixed time step St 
to solve the differential equations (9), i.e. we iterate the 
se t of n = K X N equations, 

K 

+ $*) = u pt(*) + ££{“ u pt W ” A ^ V q i(t) 

Q^P 

— Vgi(t) “ 1] " CRpiV p i(t) + Ipi ) > (II) 

< 7=1 

until the system converges to a stationary state. The only 
stopping criterion we use is when the changes in the fir- 
ing rates become insignificant, i.e. when all | Vpi(t + — 

Vpi(t)| < e, where c < 1. After the network converges to 
a solution, we must check if it is a valid solution that sat- 
isfies the syntax, i.e. for every point i we must have one 
Vpi = 1 and all the rest V qi = 0 for q / p. In analog net- 
works the activity of a neuron can never become exactly 
0 or 1 and can only reach close to the limits. Therefore, 
if V p i < rjo we take Vpi = 0, and if Vpi > 1 — r/i we take 
Vpi = 1, where r? 0 and rj x are small positive numbers. 
In the simulations we have chosen the following parame- 
ter values: time step St = 10" 3 , convergence parameter 
e = 10" 4 , and the syntax parameters rjo = *?i = 0.2. 

Mapping onto a SIMD parallel processor was accom- 
plished by assigning a unique processing element to each 
data point. With this requirement, all of the necessary 
operations reduce to simple array arithmetic, parallel 
sums, row and column broadcasts, and global boolean 
tests. All of these are the strong points of a massively 
parallel processor such as the MPP. Since the MPP has 
16384 processors, fewer data points allow more separate 
trials to be run in parallel. Thus, for example, the 128 
point case allowed for 128 trials with different starting 
conditions to be run at the same time. The overhead to 
the program to keep track of the different trials is trivial 
since the data movement required is straightforward and 
controlled by the programmer. The set of data points is 
replicated for each trial run in parallel. 

Each processor has stored in its memory its coordinate 
values and y*, the neuronal activities input signals 
Up i, residues Rpi for p = 1, * * ' , K , convergence indicators 
for each neuron, and other ancillary information. The 
processing begins with the calculation of the centroids of 
each cluster according to (10). This involves a simple ar- 
ray multiplication of the and y% by Vpi for each cluster 


p = 1, • • . , K . This result is summed using the cascading 
sum technique [13] and divided by the sum of Vpi for each 
cluster. These centroids are broadcast in parallel over the 
remainder of the array using the MPP micrcoded broad- 
cast primitive. This primitive, designed by Rudi Feiss 
(described in [14]) is very fast using only 231 cycles to 
broadcast a row or column - 128 32 bit numbers - to 
the remainder of the rows or columns of the 128x128 
array. Then we calculate the residues from (5) which 
involves more array arithmetic. The new input signals 
Upi(t + 5t) are calculated from (11) and the new activi- 
ties Vpi{t+ St) are calculated from the sigmoid function 

(2). These are all array arithmetic operations. A boolean 

mask for each cluster is created in parallel to record where 
the new activities are different from the old activities by 
more than the convergence parameter e. A logical ‘or 1 
(implemented as the ANY function in MPP Pascal) on 
the masks determines whether the convergence criteria 
has been met for all activities. This logical ‘or’ directly 
translates into a hardware instruction on the MPP and 
thus allows simultaneous checking of conditions which on 
a serial processor would have to be done individually. Up- 
dating of all neurons for each trial was continued, regard- 
less of whether a particular trial had converged, until all 
trials had converged. Thus unnecessary bookkeeping time 
is eliminated. 

Thus the speed on the MPP is obtained from, (i) the 
mapping which allows most operations to be formulated 
in terms of array arithmetic, (ii) the movement of data 
among the processing elements which can be done with 
parallel algorithms, and (iii) the global boolean tests 
which are done by the machine hardware. For the case 
of 128 points to be clustered into 5 clusters, 128 trials 
were run simultaneously. This required 19 seconds per 
500 iterations. The corresponding CPU time on a VAX 
8800 was 2940 seconds (a speedup of over 150 times) , and 
21100 seconds on a VAX 11/780 (a speedup of about 1100 
times). 

EXAMPLES 

To study the performance of the neural net we have 
tested it on some examples. In the first data set, there are 
128 points divided among 5 clusters with within-cluster 
Gaussian distributions (Fig. la). Here the 5 clusters are 
rather well defined and out of the 128 trials the neural 
net found the optimum clusters 128 times. The aver- 
age number of iterations for convergence was 4263; since 
St — 10" 3 r, the average convergence time is about 4.3r, 
where r is the decay time of a neuron. In VLSI im- 
plementations of neural networks that are currently in 
progress [l], the decay time of neurons, r, is in the range 
10" 6 - 10" 3 second, hence the convergence time of the 
network should be in the range of a few micro-seconds to 
a few milli-seconds. Note that from numerical solution of 
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differential equations one can only obtain an estimate of 
the actual convergence time, because the number of itera- 
tions for convergence depends on the value of the conver- 
gence parameter as well as the time step. Obviously if the 
convergence parameter is made smaller it will take more 
iterations for the network to meet the convergence crite- 
rion, resulting in a higher estimate for the convergence 
time. On the other hand if the time step is made smaller 
by, say, a factor of 10, it will take fewer than 10 times 
the number of iterations to converge, thus resulting in a 
lower estimate for the convergence time. Fig. 2 shows in 
more detail the number of iterations for the convergence 
of all the 128 trials. 

The conventional method of Forgy [4] in 128 trials 
found the best clusters only 46 times and various other 
solutions 82 times. The average number of iterations for 
convergence was 7. Clearly, in this example, the neural 
net outperforms the conventional method, in that it finds 
the best solution much more frequently. On the other 
hand, the conventional method takes far fewer iterations 
to converge than the neural net. But we should bear in 
mind that these are simulations of the neural net, and 
that the number of iterations needed for convergence is 
not the true measure of the processing time of the net- 
work. The convergence time of an actual analog VLSI 
network must be measured in r , the characteristic time 
of a neuron, which is in the micro to milli-second ranged 

To test the performance of the network in cases where 
clusters are fuzzy, we started from the data points of Fig. 
la, randomly selected 10% of the points and distributed 
them uniformly throughout the unit square (Fig. lb). 
Thus we obtained 5 clusters with uniform background 
noise. The neural net in 128 trials found the best clusters 
28 times. It failed to find valid solutions statisfying the 
syntax 46 times. This large number of failed solutions can 
be interpreted as an indication that the clusters are fuzzy, 
that there are outliers, and that perhaps the specified 
number of clusters, K = 5, is too few. However, even 
when the syntax is not satisfied we can extract a valid 
solution with the following scheme. For each point t set 
the largest to 1 and all the other V q i with q ^ p 
to 0, and interpret this solution as the one favored by 
the network, thus we obtain 128 solutions. Conventional 
algorithms always find valid solutions and cannot give an 
objective indication of the fuzziness of clusters. 

Similarly to Fig. lb, we generated other data sets by 
increasing the background noise to 25%, 50%, 75%, and 
100% (i.e. no clusters). These data are shown in Fig. 
lc-f. The results of partitioning the data among 5 clus- 
ters obtained, in 128 trials, with the neural net and with 
Forgy 8 method are listed in Table 2. The average es- 
timated convergence times for the network are given in 
units of r. Two points of note in this table are: (i) As the 


5 clusters become less discernible the network increas- 
ingly fails to satisfy the syntax indicating that clusters 
are fuzzy and that 5 clusters are not sufficient. The con- 
ventional method, on the other hand, always finds valid 
solutions, and although the variety of solutions that it 
finds increases (this is true in both methods) which may 
be taken as a clue to the fuzziness of clusters it is not as 
objective an indicator as the failure to satisfy the syntax; 
(ii) When there are well defined clusters the neural net 
performs better than the conventional techniques which is 
reflected in the lower average X 2 ( x 2 is the sum of within- 
cluster variances) for solutions found by the neural net. 
And as clusters become fuzzier the quality of solutions 
found by both methods become comparable. 

Table 2: In this table the results obtained by Forgy’s 
conventional algorithm are compared with those by the 
neural network. The Data refer to data points of Fig. 
la-f. These are based on 128 trials. 



Iter: is the average number of iterations for convergence. 
Best Var: is the variance of the best solution found. 
Best%: is the percentage of trials that found the best 
solution. 

Avg Var: is the average variance of the solutions found. 
Time, is the average estimated time of convergence in 
units of r, 

Synt%: is the percentage of trials that found solutions 
satisfying the syntax. 

In Fig. 3, we have plotted the trajectories of the cen- 
troids of the 5 clusters as a function of time for all the 128 
trials for the data of Fig. la. It can be seen that although 
the centroids start from different places in different trials, 
they all eventually converge to the same 5 points which 
are the true centroids of the 5 clusters. This clearly shows 
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that the network succeeds, in every trial, in finding the 
structure in the data. In Fig. 4, we have plotted the cen- 
troid trajectories for the data of Fig. If. The spreading 
of trajetories (as contrasted to the contraction of trajec- 
tories in Fig. 3) of different trials, shows that where there 
is no underlying structure in the data, the network does 
not prefer any particular clustering and hence finds many 
different solutions. 

CONCLUDING REMARKS 

Preliminary results for clustering with neural networks 
are promising. The neural net appears to outperform con- 
ventional iterative techniques, when there are well defined 
clusters since it finds better solutions more frequently. 
And when clusters are fuzzy, or when the number of clus- 
ters we specify is not compatible with the structure of 
data, the neural net indicates that it cannot find valid 
solutions easily, and that something may be wrong. This 
indicator is an objective measure and hence more reliable 
than the user supplied bounds and tolerances for conven- 
tional techniques. Work on larger data sets is in progress. 

The clustering criterion we have used in this paper, 
that is minimum sum of within-cluster variances, results 
in convex compact clusters. Often clusters are not round 
or compact. By adding to the energy function, appropri- 
ate terms that favor closeness of a point to its neighbors 
(and not just to the cluster centroid), one can design a 
network that finds non-convex elongated clusters of vari- 
ous shapes. 

Simulations of the neural net on the MPP for the clus- 
tering problem are two to three orders of magnitude faster 
than simulations on serial machines such as the VAX 8800 
and VAX 11/780. The speedup is due to parallel solution 
of the differential equations that govern the behavior of 
the network, as well as running several trials at the same 
time. However, the real benefit of neural nets may lie 
in the future when they can be mapped on analog chips. 
There are forecasts that analog VLSI neural nets will be- 
come available in several years [l]. These devices will 
have processing times in the micro to milli-second range, 
making their performance comensurate with human per- 
ceptual abilities. 
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Fig. 1. 128 points divided among 5 clusters and re- 

spectively 0,10,25,50,75,100 % uniform background in 

ajbjCjdjejf. 



Fig. 2. Number of trials not converged versus iteration 
for the data in Fig. la. (0 % background) (loop is the 
iteration number). 



trials for the data in Fig. la. (0 % background). Lower 
left corner of Fig. la. corresponds to back top corner in 
this figure 



Fig. 4. Trajectories of the five cluster centroids for the 
data in Fig. If. (uniform distribution - 100% back- 
ground). 
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